dplyr

Haley Jeppson, Joe Papio,
Sam Tyner

June 14, 2017

Numeric Summaries with dplyr

Baseball Data

  • We would like to create career summary statistics for each player
  • Plan: subset on a player, and compute statistics
ss <- subset(baseball, id == "sosasa01")
head(ss)
##             id year stint team lg   g  ab  r   h X2b X3b hr rbi sb cs bb
## 66822 sosasa01 1989     1  TEX AL  25  84  8  20   3   0  1   3  0  2  0
## 66823 sosasa01 1989     2  CHA AL  33  99 19  27   5   0  3  10  7  3 11
## 67907 sosasa01 1990     1  CHA AL 153 532 72 124  26  10 15  70 32 16 33
## 69018 sosasa01 1991     1  CHA AL 116 316 39  64  10   1 10  33 13  6 14
## 70599 sosasa01 1992     1  CHN NL  67 262 41  68   7   2  8  25 15  7 19
## 71757 sosasa01 1993     1  CHN NL 159 598 92 156  25   5 33  93 36 11 38
##        so ibb hbp sh sf gidp
## 66822  20   0   0  4  0    3
## 66823  27   2   2  1  2    3
## 67907 150   4   6  2  6   10
## 69018  98   2   2  5  1    5
## 70599  63   1   4  4  2    4
## 71757 135   6   4  0  1   14
mean(ss$h / ss$ab)
## [1] 0.2681506

We need an automatic way to calculate this.

for loops

  • Idea: repeat the same (set of) statement(s) for each element of an index set
  • Setup:
    • Introduce counter variable (sometimes named i)
    • Reserve space for results
  • Generic Code:
result <- rep(NA, length(indexset))
for(i in indexset){
  ... some statements ...
  result[i] <- ...
}

for loops for Baseball

  • Index set: player id
  • Setup:
players <- unique(baseball$id)
n <- length(players)

ba <- rep(NA, n)

for(i in 1:n) {
  career <- subset(baseball, id == players[i])
  ba[i] <- with(career, mean(h / ab, na.rm = TRUE))
}

summary(ba)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1831  0.2459  0.2231  0.2699  0.5000       6

for loops for Baseball

  • Index set: player id
  • i = 0:
players <- unique(baseball$id)
n <- length(players)

ba <- rep(NA, n)

head(ba)
## [1] NA NA NA NA NA NA

for loops for Baseball

  • Index set: player id
  • i = 1:
players <- unique(baseball$id)

ba <- rep(NA, length(players))

for(i in 1:1) {
  career <- subset(baseball, id == players[i])
  ba[i] <- with(career, mean(h / ab, na.rm = TRUE))
}

head(ba)
## [1] 0.3371163        NA        NA        NA        NA        NA

for loops for Baseball

  • Index set: player id
  • i = 2:
players <- unique(baseball$id)

ba <- rep(NA, length(players))

for(i in 1:2) {
  career <- subset(baseball, id == players[i])
  ba[i] <- with(career, mean(h / ab, na.rm = TRUE))
}

head(ba)
## [1] 0.3371163 0.2489226        NA        NA        NA        NA

Your Turn

  • MLB rules for the greatest all-time hitters are that players have to have played at least 1000 games with at least as many at-bats in order to be considered
  • Extend the for loop above to collect the additional information
  • Introduce and collect data for two new variables: games and atbats

How did it go? What was difficult?

  • Household chores (declaring variables, setting values each time) distract from real work
  • Indices are error-prone
  • Loops often result in slow code because R can compute quantities using entire vectors in an optimized way

The dplyr package

The pipe operator %>%

f(x) %>% g(y) is equivalent to g(f(x), y)

i.e. the output of one function is used as input to the next function. This function can be the identity

Consequences:

  • x %>% f(y) is the same as f(x, y)
  • statements of the form k(h(g(f(x, y), z), u), v, w) become x %>% f(y) %>% g(z) %>% h(u) %>% k(v, w)
  • read %>% as “then do”

dplyr verbs

There are five primary dplyr verbs, representing distinct data analysis tasks:

  • Filter: Select specified rows of a data frame, produce subsets
  • Arrange: Reorder the rows of a data frame
  • Select: Select particular columns of a data frame
  • Mutate: Add new or change existing columns of the data frame (as functions of existing columns)
  • Summarise: Create collapsed summaries of a data frame
  • (Group By: Introduce structure to a data frame)

Filter

library(tidyverse)
data(french_fries, package = "reshape2")
french_fries %>%
    filter(subject == 3, time == 1)
##   time treatment subject rep potato buttery grassy rancid painty
## 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 3    1         2       3   1   13.9     0.0    0.0    3.9    0.0
## 4    1         2       3   2   13.4     0.1    0.0    1.5    0.0
## 5    1         3       3   1   14.1     0.0    0.0    1.1    0.0
## 6    1         3       3   2    9.5     0.0    0.6    2.8    0.0

Look at ?reshape2::french_fries to learn more about the data

filter is similar to the base function subset

Multiple conditions in filter are combined with a logical AND (i.e. all conditions must be fulfilled)

Logical expressions can be used e.g. filter(subject == 3 & time == 1)

Arrange

french_fries %>%
    arrange(desc(rancid), potato) %>%
    head
##   time treatment subject rep potato buttery grassy rancid painty
## 1    9         2      51   1    7.3     2.3      0   14.9    0.1
## 2   10         1      86   2    0.7     0.0      0   14.3   13.1
## 3    5         2      63   1    4.4     0.0      0   13.8    0.6
## 4    9         2      63   1    1.8     0.0      0   13.7   12.3
## 5    5         2      19   2    5.5     4.7      0   13.4    4.6
## 6    9         1      63   1    1.7     0.0      0   13.3   12.8

Successive variables are used for breaking ties from previous variables.

Select

french_fries %>%
    select(time, treatment, subject, rep, potato) %>%
    head
##    time treatment subject rep potato
## 61    1         1       3   1    2.9
## 25    1         1       3   2   14.0
## 62    1         1      10   1   11.0
## 26    1         1      10   2    9.9
## 63    1         1      15   1    1.2
## 27    1         1      15   2    8.8

Summarise

french_fries %>%
    summarise(mean_rancid = mean(rancid, na.rm=TRUE), 
              sd_rancid = sd(rancid, na.rm = TRUE))
##   mean_rancid sd_rancid
## 1     3.85223  3.781815

Summarise and Group_by

french_fries %>%
    group_by(time, treatment) %>%
    summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
## Source: local data frame [30 x 4]
## Groups: time [?]
## 
## # A tibble: 30 x 4
##      time treatment mean_rancid sd_rancid
##    <fctr>    <fctr>       <dbl>     <dbl>
##  1      1         1    2.758333  3.212870
##  2      1         2    1.716667  2.714801
##  3      1         3    2.600000  3.202037
##  4      2         1    3.900000  4.374730
##  5      2         2    2.141667  3.117540
##  6      2         3    2.495833  3.378767
##  7      3         1    4.650000  3.933358
##  8      3         2    2.895833  3.773532
##  9      3         3    3.600000  3.592867
## 10      4         1    2.079167  2.394737
## # ... with 20 more rows

Mutate

Change an existing or create a new variable into the data

french_fries %>%
    mutate(
      awful = (buttery+potato)/2 - (grassy+painty+rancid)/3
    ) %>% glimpse()
## Observations: 696
## Variables: 10
## $ time      <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ treatment <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject   <fctr> 3, 3, 10, 10, 15, 15, 16, 16, 19, 19, 31, 31, 51, 5...
## $ rep       <dbl> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2...
## $ potato    <dbl> 2.9, 14.0, 11.0, 9.9, 1.2, 8.8, 9.0, 8.2, 7.0, 13.0,...
## $ buttery   <dbl> 0.0, 0.0, 6.4, 5.9, 0.1, 3.0, 2.6, 4.4, 3.2, 0.0, 0....
## $ grassy    <dbl> 0.0, 0.0, 0.0, 2.9, 0.0, 3.6, 0.4, 0.3, 0.0, 3.1, 0....
## $ rancid    <dbl> 0.0, 1.1, 0.0, 2.2, 1.1, 1.5, 0.1, 1.4, 4.9, 4.3, 2....
## $ painty    <dbl> 5.5, 0.0, 0.0, 0.0, 5.1, 2.3, 0.2, 4.0, 3.2, 10.3, 2...
## $ awful     <dbl> -0.3833333, 6.6333333, 8.7000000, 6.2000000, -1.4166...

Careful, trap!

Why does

french_fries$awful
## NULL

not return a real-valued summary?

  1. Because we never saved it back into the french_fries data
  2. Go back and have a look ;)

mutate or summarize?

Both commands introduce new variables - so which one should we use?

Differences:

  • mutate adds variables to the existing data set: the resulting variables must have the same length as the original data, e.g. use for transformations, combinations of multiple variables
  • summarize creates aggregates of the original data. The number of rows of the new dataset is determined by the number of combinations of the grouping structure. The number of columns is determined by the number of grouping variables and the summary statistics.

Let’s use these tools

to answer these french fry experiment questions:

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do ratings change over the weeks?

Completeness

If the data is complete it should be 12 (subjects) x 10 (weeks) x 3 (treatments) x 2 (replicates), that is, 6 records for each person in each week. (Assuming that each person rated on all scales.)

To check this, we want to tabulate the number of records for each subject, time and treatment. This means select appropriate columns, tabulate, count and spread it out to give a nice table.

Visual approach: completeness

french_fries %>% 
  ggplot(aes(x = time)) + geom_bar() + facet_wrap(~subject) 

Numeric summary of completeness

dim(french_fries) # too few rows - should be 720 = 12 x 10 x 6
## [1] 696   9
# if subjects come, they give all six evaluations
french_fries %>% group_by(time, subject) %>% summarize(n = n()) %>% summary()
##       time       subject         n    
##  1      :12   10     :10   Min.   :6  
##  2      :12   15     :10   1st Qu.:6  
##  3      :12   16     :10   Median :6  
##  4      :12   19     :10   Mean   :6  
##  5      :12   51     :10   3rd Qu.:6  
##  6      :12   52     :10   Max.   :6  
##  (Other):44   (Other):56

Why are all of the ns equal to 6?

Numeric summary of completeness (cont’d)

# not all subjects come all the time:
french_fries %>% group_by(subject) %>% summarize(n = n()) %>% arrange(n)
## # A tibble: 12 x 2
##    subject     n
##     <fctr> <int>
##  1       3    54
##  2      31    54
##  3      79    54
##  4      86    54
##  5      10    60
##  6      15    60
##  7      16    60
##  8      19    60
##  9      51    60
## 10      52    60
## 11      63    60
## 12      78    60

Shortcuts

  • summarize(n = n()) is equivalent to tally()
  • group_by(week, subject) %>% summarize(n = n()) is equivalent to count(week, subject)

Are replicates similar?

reps <- french_fries %>% group_by(time, subject, treatment) %>%
  summarise(
    potato_diff = diff(potato),
    potato = mean(potato)
  )
reps
## Source: local data frame [348 x 5]
## Groups: time, subject [?]
## 
## # A tibble: 348 x 5
##      time subject treatment potato_diff potato
##    <fctr>  <fctr>    <fctr>       <dbl>  <dbl>
##  1      1       3         1        11.1   8.45
##  2      1       3         2        -0.5  13.65
##  3      1       3         3        -4.6  11.80
##  4      1      10         1        -1.1  10.45
##  5      1      10         2         1.7  10.15
##  6      1      10         3        -1.2  10.70
##  7      1      15         1         7.6   5.00
##  8      1      15         2        -2.0   8.00
##  9      1      15         3         2.2   6.90
## 10      1      16         1        -0.8   8.60
## # ... with 338 more rows

Are replicates similar? (cont’d)

reps %>% 
  ggplot(aes(x = potato, y = potato_diff, colour = as.numeric(time))) + 
  facet_wrap(~subject) +
  geom_hline(aes(yintercept=0)) +
  geom_point() 

Your Turn (10 min)

Try to answer (a part of) the question: are different ratings similar?

Note: there are many different ways of answering this question. Choose one numeric and one visual approach

Your Turn - One Solution

french_fries %>% 
  ggplot(aes(x = potato, y = buttery)) + geom_point() +
  theme(aspect.ratio=1) + xlim(c(0,15)) + ylim(c(0,15)) +
  geom_abline(colour = "grey50")

Your Turn - One Solution (cont’d)

The package GGally has an implementation of a scatterplot matrix using ggplot2:

GGally::ggpairs(data = french_fries[ ,5:9])

For the numeric approach of a summary we could compute means across subjects for each week and compare those values:

ffm <- french_fries %>% group_by(time) %>% summarise(
  potato = mean(potato,  na.rm=TRUE),
  buttery = mean(buttery,  na.rm=TRUE),
  painty = mean(painty,  na.rm=TRUE)
)

Your Turn - One Solution (cont’d)

ffm %>%
  ggplot(aes(x = time, y = potato)) + geom_point(colour = "blue") +
  geom_point(aes(y = buttery), colour = "forestgreen") +
  geom_point(aes(y = painty), colour = "red") +
  ylab("Score")

This doesn’t look like the most elegant or most efficient way of answering the question: the data is in an awkward form!

Another Your Turn (10 min)

The dataset ChickWeight is part of the core packages that come with R (i.e. data(ChickWeight) gets the data into your active session). From the help file:

There were four groups of chicks on different protein diets. The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21.

Your Turn (10 min)

  1. Introduce a variable gain into the data that keeps track how much weight each chick has gained since Time 0 (i.e. gain at time 0 should be 0). Plot weight gain over time. Connect the observed weight gain for each chick by a line. Facet by diet.
  2. Focus on weight gains on day 21. Draw side-by-side dotplots of weight gains by diet. Summarize the average weight gain on day 21 under each diet. Overlay the dotplots by error bars around the average weight gain under each diet (see ?geom_errorbar)

Re-cap

  • getting used to dplyr actions takes a bit of time and practice
  • recognize keywords and match them to dplyr functions
  • make a conscious effort in applying dplyr functions in your regular workflow: that means you have to do things differently - the long-term benefits are there, promise!